Introduction

Background

If you are looking to learn about clustering and have some prior experience with data science and R, this article is for you! We are three graduate students at the University of Virginia School of Data Science and we are here to explain the basics of clustering as well as some applications and examples in an easily digestible format.

What is clustering?

Many data science problems revolve around prediction. These problems use training data where the response variable is known and attempt to build a model to predict the response when it is not known. However, some data science problems seek to simply group the data points according to some underlying structure or natural categories that are unknown. These situations are referred to as unsupervised learning because we do not know the true class labels yet we still seek to sort the data into groups. One of the most common approaches to solving this problem is clustering. There are a variety of clustering methods, which we will dig into later, but they all share the same purpose of sorting the data into groups based on features or metrics.

Why clustering?

Clustering is often used to gain insight into the data by revealing patterns and similarities. Data scientists can use the groupings produced by clustering to identify common features of subgroups that may be useful to their domain-specific problem.

Distance Metrics

In order to understand the algorithms used for clustering, you first need to have a foundational understanding of distance metrics. The measure of distance between two points is used to determine to which cluster a data point should belong. There are two primary distance metrics that are used: Euclidean distance and Manhattan distance. Euclidean distance is the shortest path between points A and B, and it can always be represented by a straight line. A common analogy for Euclidean distance is the path a bird would fly between two points because it can fly in a straight line. Manhattan distance is also called city block distance because it is the distance between two points if they are constrained on some grid-like path. As the name suggests, Manhattan distance would be the distance you travel when walking between two points in New York City since you must follow the grid pattern of the roads. In machine learning problems, Manhattan distance is often used for clustering problems where the data has a large number of dimensions, while Euclidean distance is more common for lower dimensional data. The formula for the Manhattan distance is: \(d = \Sigma^n_{i=1} |x_i - y_i|\). The formula for Euclidean distance is \(d(x,y) = \sqrt{\Sigma^n_{i=1}(x_i-y_i)^2}\).

Another distance metric that is used is Jaccard distance, which is a measure of how dissimilar two sets are. This metric is an example of a distance used in set theory rather than on a cartesian plane. We can calculate the Jaccard distance by \(d_J(A,B) = 1 - \frac{|A\cap B|}{|A\cup B|} = 1 - \frac{|A\cap B|}{|A| + |B| - |A\cap B|}\).

Standardization

Since the distance metric plays a key role in clustering, we want the comparison of distances between different features to be on the same scale. Thus, it is important to standardize the data prior to running clustering. The most common approach to standardizing each feature is to take the observation value and subtract the mean of that feature then divide by the standard deviation. This allows for a more fair comparison of distances while preserving the proportional distance for each feature. The resulting value is the number of standard deviations the value falls from the mean.

Hard vs Soft Classification

Another important concept to master regarding clustering is hard versus soft classification. Hard classification is a natural way to think about clustering because it takes each data point and assigns it to a cluster or group. Thus, all of the data points receive a single cluster label. Soft classification, on the other hand, assigns each data point a probability of being in each cluster rather than just a single label. In reality, soft classification happens behind the scenes of hard classification and the data point is assigned to the cluster to which it has the greatest probability of belonging. However, it can be useful not to assign these hard labels and rather to look at the probabilities produced by soft classification. By doing so, we are able to better account for the uncertainty regarding our clustering results.

Applications

DNA analysis

Methods

Kmeans

  • Introduction: K-Means clustering is a popular method for unsupervised learning, due to its speed of computation relative to other clustering methods. A high level explanation of K-means is the following: K-means is an iterative algorithm that works by assigning points to the cluster with the nearest centroid, updating the cluster centroids based on the new points belonging to the respective cluster, and repeating this process until no point changes cluster from the previous iteration. To accomplish this, we must start with a number of cluster centroids, meaning we must also know the number of clusters we would like to end up with. This is one of the downsides of k-means clustering, however, there are a few heuristics that can be used to select an “optimal” value of k based on the results of the clustering.

  • Approaches: Before examining coding details, it is important to separate our high level understanding of k-means from its actual implementation. We will walk through two of the most prominent K-Means Clustering algorithms to highlight the slight differences between the explanation provided above and the most popular modern implementation.

    • Lloyd’s Algorithm (The Classical Approach):

      The first algorithm we will discuss is Lloyd’s Algorithm. The implementation of this algorithm is essentially the same as general outline described above. The algorithm follows these steps:

      1. Randomly initializes k points to be the initial cluster centroids
      2. Assigns the points to the cluster that has the nearest centroid.
      3. Calculates the new centroid for each cluster by averaging the points relating to each respective cluster
      4. Repeats steps 2 and 3 until there is convergence (meaning the clusters remain the same in two consecutive iterations)

    Caution: There is a problem with this algorithm, though, which leads it to have a tendency to fall into a local optima depending on the choice of the random start. This extends from the fact that the points are assigned to the nearest centroid each time and does not account for how that cluster has changed in terms of its relationship between its centroid and the points that comprise it. This is why another algorithm, Hartigan-Wong K-Means, is typically used today.

    • Hartigan-Wong (The Modern Implementation):

      In the implementation of K-Means, Hartigan and Wong propose to alter how points are assigned to their new clusters. Instead of simply assigning points to the cluster with the nearest centroid, the iterative process evaluates whether or not a point should be reassigned depending on how it affects the overall Within-Cluster sum of squares. Each observation is examined by removing said observation from its current cluster, updating the centroid of the cluster it was removed from, and evaluating the cluster assignment that would minimize the total within-cluster sum of squares. This means that the point will not necessarily be assigned to the cluster with the closest centroid. Let’s examine the steps in this algorithm in a clearer list:

      1. Randomly initializes k points to be the initial cluster centroids
      2. Assigns the points to the cluster that has the nearest centroid.
      3. Calculates the new centroid for each respective cluster
      4. Iterates through each observation doing the following until convergence:
      5. Remove the point from its current cluster
      6. Update the cluster’s centroid to account for the removal
      7. Evaluate which cluster it should belong to by adding the point to each cluster, calculating the new centroid and examining the change in Total Within-Cluster Sum of Squares
      8. Assign the point to the cluster that minimizes the total sum of squares
  • Conclusion: Thus, for both algorithms, convergence to a solution is guaranteed, though these solutions may be local optima rather than global optima. Despite the possibility of both algorithms returning this local optima solution though, the Hartigan-Wong algorithm is generally the preferred algorithm for K-Means because it is harder for it to get stuck in a local optima since it considers the impact of the removal of the point from its existing cluster before moving it.

Hierarchical

Hierarchical clustering is a clustering approach that uses a distance or similarity metric as well as a linkage method to form clusters between points in n-dimensional space. The distance or similarity metric is used to account for how close two points are from one another. The most common distance metric with which people are most familiar is euclidean distance, as it is the most commonly taught distance metric in the education system. However, we will explore additional metrics as well as their benefits and drawbacks in the following section. The linkage method is then used to identify how to find a single distance metric for an entire cluster, because hierarchical clustering often involves joining single points into an existing cluster as well as joining existing clusters of points with another. Once again, there are a variety of options for a linkage method, which will also be examined in greater depth below.

The particular type of hierarchical clustering that will be explored is called Agglomerative Clustering, which is the grouping of clusters from the bottom-up. This bottom-up approach means that each point begins as its own cluster and is then grouped together with other points until all points are enclosed in a single cluster. Let’s walk through an initial example before examining the various types of distance and linkage methods.

Hierarchical Clustering Introduction: Types of Restaurants

You have received the following dataset containing information about a number of restaurants that tracks the number of customers that they serviced in the morning and afternoon, respectively. You are tasked with grouping together restaurants that are similar to one another. The data is as follows:

Restaurant Customer Data
Restaurant Name Morning Customers Afternoon Customers
Cup of Joe 150 0
Jonny’s B&B 100 10
McDonald’s 125 1000
Corner Tavern 10 750
Burger King 100 850
Brian’s Spirits 5 500
To figure out how similar these restaurants are, we can plot these points on a graph, with one dimension representing morning customers (see our x-axis) and another representing afternoon/evening customers (see our y-axis):
Hierarchical Example

To perform hierarchical clustering, we are going to calculate the pairwise distances between all points. For the sake of simplicity, we will use Euclidean distance, also known as the traditional distance formula that most people learned in grade school.

Hierarchical clustering begins by merging the two nearest points, which we can also consider to be one point clusters:

Hierarchical Example

From the distance matrix, we can see that the closest distance between two clusters is between Corner Tavern & Brian’s Spirits which were calculated to be 0.5877 units from one another using our euclidean distance calculation.

These two restaurants are now considered to be in a cluster together, so when we make subsequent merging of clusters, we must decide how to represent the distance from a cluster to a cluster that is now comprised of more than one point. This is where our choice of linkage method comes in. For this example, we will use Single Linkage. The choice of Single Linkage means that we will represent the distance between a point and a cluster as the smallest distance between a point in one cluster and a point in another (this is essentially taking all pairwise distances between points in two respective clusters and representing the distance between the two clusters overall as the minimum of these distances).

Hierarchical Example

In the figure above, we illustrate with the red line that the distance from the Corner Tavern cluster to the cluster that now includes both Burger King and McDonald’s is represented by the distance from Burger King to Corner Tavern. This is due to the fact that Corner Tavern is closer to Burger King than McDonald’s.

We repeat this very same process of updating clusters by merging the two clusters with the smallest distance between them until we have one cluster that contains all of the points.

We can visualize the clustering using something called a dendrogram. A horizontal bar is drawn between the lines representing the level of dissimilarity, represented by distance in this case, at which we join two clusters together. This horizontal bar is drawn at the height relating to the distance between each cluster. Following the example above, we can see the process for creating the dendrogram as we cluster.

Hierarchical Example

Note that the lowest black horizontal bar is the one that connects McDonald’s and Burger King and occurs at the height 0.5877. This is the way a cluster is represented in a dendrogram and is equivalent in concept to the black oval placed around the first cluster in the previous image. The lower the black horizontal bar in a dendrogram, the earlier the two touching clusters are joined. The height at which two clusters are joined represents the distance between them.

Caution: Even if we maintain the method for calculating distance, changing our linkage method could impact the results of the clustering. Below is an example of using another linkage method, complete linkage, when clustering our data.
Hierarchical Example
In this toy example with so few points, the choice of linkage method does not appear to dramatically impact the results of the clustering. The general shape appears to be the same, but if you look closely, the y-axes are different. It is very possible that we get different clusters if we use different linkage methods. Why might this be?
If we are calculating the distance between Cluster 1, which has only one point, to Clusters 2 and 3, which are both comprised of many points, we might find the following two statements to be true simultaneously: - Cluster 2’s closest point is closer to Cluster 1 than Cluster 3’s closest point - Cluster 3’s furthest point is closer to Cluster 1 than Cluster 2’s furthest point An example of this could be found if we view the below image with following in mind: the Red point is Cluster 1, the Blue points represent Cluster 2 and the Green points represent Cluster 3.
Hierarchical Example

In this case that we selected Single Linkage as our method of choice, we would merge Clusters 1 and 2 first, but if we used Complete Linkage, we would merge Clusters 1 and 3 first.

This concept finally brings us to a discussion on the various types of linkage methods, how the specific methods work, and why they might be used in particular scenarios. While this is not an exhaustive list, it does provide many of the most widely applicable methods:

Single Linkage

  • As mentioned in the walk-through above, Single Linkage represents the distances between two clusters as the minimum distance between points in each respective cluster. In other words, we represent the distance between the two clusters overall by the smallest distance between any pair of points between the two clusters. This can be thought of as finding the minimum of the distances between any point in one cluster to any point in the other cluster.
    Hierarchical Example
    This method generally leads to clusters that have chains of points, meaning that the points with a given cluster are farther apart from one another.

Complete Linkage

Complete linkage represents the distance between two clusters as the distance from the furthest two points in the two respective clusters. This can also be thought of as the maximum of the pairwise distances between a point in one cluster and a point in the other cluster.
Hierarchical Example

Generally, Complete Linkage leads to compact clusters, meaning that points within the same cluster are often relatively close to one another.

Average Linkage

The Average Linkage Method represents the distance between two clusters as the average distance between each of any point in the first cluster to any point in the second. Essentially, all pairwise distances are calculated across clusters and the average of these distances is considered the overall distance between clusters.

Hierarchical Example

Generally, Average Linkage leads to clusters of points that are neither as compact as Complete Linkage nor as far apart as they are in Single Linkage.

Centroid Linkage

Centroid Linkage takes the average of the points, giving us a centroid for each cluster, and then uses the distance between the centroids to represent the distance between the two clusters. Centroid Linkage can be easily confused with Average Linkage since both are taking an average, however, the key difference comes from when this average is taken. To reiterate: Centroid Linkage takes the average of points and then gets the distance between centroids whereas Average Linkage takes the pairwise distances to the points across the clusters and then averages all such distances.

Hierarchical Example

Generally, Centroid Linkage is used when the centroids of the cluster have a particular meaning or when ease of interpretation is needed. It leads to similar results as Average Linkage. Caution: Despite its simplicity, one issue to note with Centroid Linkage is the possibility of inversion. Inversion is the phenomenon that occurs when two clusters are more similar to one another compared to two clusters that have already been merged. In the previous three methods, we were guaranteed that there would be no inversion, however, centroid linkage introduces this possibility.

Ward’s Linkage

Ward’s Linkage turns out to extend Centroid Linkage in calculation, however, it should be thought of as representing the distance between the two clusters by finding the increase of the total Within-Cluster Variation when merging the two clusters together. When we simplify the formula to achieve this, we find the final calculation of the distance between two clusters, Cluster A and Cluster B, to be the following:
\(\frac{N_A * N_B}{N_A + N_B} * d^2(C_A, C_B)\), where \(N_A\) is the number of observations in Cluster A, \(N_B\) is the number of observations in Cluster B, \(C_A\) and \(C_B\) are the centroids of Clusters A and B, respectively, and \(d^2\) represents finding the squared euclidean distance between two points (note this is not standard euclidean distance).

Hierarchical Example

Thus, using Ward’s Linkage Distance can sometimes be similar to Centroid Linkage in terms of it’s compactness, however, it’s acknowledgement of the change in Within-Cluster Variation means that results are not always the same. Additionally, Ward’s Linkage Method is monotonic, meaning that inversion will not be an issue like it is in Centroid Linkage clustering.

An Interestingly Connection to K-Means: Ward’s Linkage Method appears to adopt a somewhat similar approach to the Hartigan-Wong K-Means algorithm to decide merging: using Total Within-Cluster variation to update cluster assignment. The large difference here, however, is that Ward’s Linkage calculates the change in Total Within-Cluster Variation based on the idea of merging all observations between the two existing clusters rather than simply evaluating the current fit of a particular observation in a cluster using Total Within-Cluster Variation. The point we make here is that both use the same concept to update clusters, though they are required to implement it in their own way. This is also a friendly reminder that Hierarchical Clustering makes assignments in a sequential manner, continuing to build upon assignments over time until all observations fall under the same, overarching cluster, whereas K-Means iterates through the observations (after initially assigning all observations to a cluster), continuously evaluating whether the points placement in another cluster would reduce the total Within-Cluster Variation compared to their current assignment.

Gaussian Mixture Models (GMM)

  • Introduction: Recall that the goal of clustering is to sort the data points into groups according to some features or underlying structure in the data. One way to do this is to use mixture models. Mixture models are a clustering technique that assumes the data comes from two or more distributions and then attempts to sort the data into these distributions. While mixture models can be done with various distribution types, a common approach is to use normal distributions, which is then referred to as Gaussian Mixture Models.
  • Choosing the number of distributions: There are multiple ways to choose the number of distributions to use in the mixture model. One method is to use kmeans to determine the number of clusters and use this as the number of distributions. This can be done by running kmeans for two to some larger number of clusters, say 20, and plotting the sum of squares error (SSE) for each value of k. By visually inspecting this plot, we can determine where the SEE starts to level off and does not get much smaller as we increase k, which is the number of clusters we should use. Another approach is to use a GMM algorithm to choose the number of distributions. This can be done using the Mclust function from the Mclust package and it will output the number of clusters.
  • The EM Algorithm: Gaussian Mixture Models use the Expectation-Maximization Algorithm to determine the values of the parameters for the distributions in the mixture model. There are two steps in this algorithm: the E-step and the M-step. The E-step calculates the responsibilities, which are the posterior probabilities of the data points coming from each distribution or component. For example, responsibility \(r_{ik}\) is the probability that the data point \(i\) came from distribution \(k\). The M-step uses the responsibilities to estimate the parameters of the distribution. The two steps are done iteratively, starting with the E-step. This means that we must first select values of the parameters \(\theta\) that will be used to calculate the responsibilities in the initial E-step. In this iterative approach, the M-step then uses these responsibilities to recalculate the parameters, which are then used in the next E-step to again calculate the responsibilities. These steps are performed sequentially until they show signs of convergence, which means that the values of the responsibilities and parameters are changing only minimally at each iteration. The result is that we have the values of the parameters for each distribution in our mixture model as well as the probabilities of the points belonging to each of the distributions.
  • Cautions: It is important to note that the EM algorithm depends on the choice of initial values for the parameters in the first E-step. A poor choice for these parameters will lead to poor results of the Gaussian Mixture Model, which may include the EM algorithm converging to local solutions rather than the global solutions. One way to account for this is to run the mixture model several times with different initial values and compare the results. It is also helpful to be strategic in choosing the initial parameter values by looking at metrics from the data rather than arbitrarily choosing them.

Example

Here we explore a sample dataset based on a dataset from the National Institute of Standards and Technology. The data consists of tabular autosomal Short Tandem Repeats (STRs) at 24 loci, or locations on the genome.

We have separated these profiles into 3 population reference groups: POP1, POP2, and POP3. We want to identify if the STR profiles cluster into distinct population groups.

First, we read in the data, remove NAs and sample 75 observations from each population group to ensure even sample sizes.

library(tidyverse)

Data Cleaning and Sampling

set.seed(211)
setwd('/Users/colleencallahan/Desktop/MSDS/Fall 2020/SYS 6018')
df <- read.csv("Sample_GenePop_Data.csv")
df <- na.omit(df) # remove NA's

## Sample 75 observations from each population
pop1tmp <- df[which(df$Pop=='POP1'), ]
pop2tmp <- df[which(df$Pop=='POP2'), ]
pop3tmp <- df[which(df$Pop=='POP3'), ]

pop1samplerows <- sample(nrow(pop1tmp), size=75)
pop1sample <- subset(pop1tmp[pop1samplerows,])
pop2samplerows <- sample(nrow(pop2tmp), size=75)
pop2sample <- subset(pop2tmp[pop2samplerows,])
pop3samplerows <- sample(nrow(pop3tmp), size=75)
pop3sample <- subset(pop3tmp[pop3samplerows,])

newdf <- rbind(pop1sample, pop2sample)
newdf <- rbind(newdf, pop3sample)
newdf[1:5,1:10]
##      Pop CSF1PO CSF1PO.2 D10S1248 D10S1248.2 D12S391 D12S391.2 D13S317
## 28  POP1     10       12       12         12      17        18      11
## 63  POP1     11       12       14         15      16        21      10
## 31  POP1     10       13       12         13      18        18      12
## 112 POP1      8       10       14         15      17        19      12
## 229 POP1      9       11       12         13      19        19      11
##     D13S317.2 D16S539
## 28         12       9
## 63         11      11
## 31         12       9
## 112        13      12
## 229        14      11
library(plyr)
library(ape)

Dimension Reduction

Next, we reduce the DNA profile data to two dimensions using multidimensional scaling (MDS).

The distance between profiles is first calculated using the dist.gene() function from the package ape. Based on the documentation, this function “computes a matrix of distances between pairs of individuals from a matrix or a data frame of genetic data.” It takes a genetic dataframe as an input, and with “pairwise=TRUE” it calculates the pairwise similarity between profiles in the dataframe. The similarity distance is based on the distance d which is the number of loci for which the profiles differ. The variance is d(L-d)/L where L is the total number of loci.

Multidimensional scaling then takes this set of dissimilarities between profiles and returns a set of points x and y such that the distances between the (x,y) points are approximately equal to the dissimilarities between profiles.

## Set seed for reproducibility
set.seed(211)

## Calculate distance for pairwise DNA profiles
dX = dist.gene(newdf[,-1], method="pairwise") 

## Run multidimensional scaling to obtain x and y coordinates from tabular DNA profile data
fit <- cmdscale(dX, eig=TRUE, k=2) # k is the number of dim

## Split out dimensions to x and y points
newcluster <- fit$points

## Add in population reference group
newcluster <- cbind(as.data.frame(newcluster), Pop=newdf$Pop)

Before running any clustering algorithms, the true population distribution can be seen below.

ggplot(newcluster) + geom_point(aes(x=V1, y=V2, color=Pop)) + labs(color='True Population')

K Means

For our first clustering approach, we run K means clustering with K=3 on the MDS points to identify how well these cluster into three distinct groups. We can assess the accuracy using a confusion matrix for the cluster groups compared against the true population groups.

## Run k means
km = kmeans(newcluster[1:2], centers=3, nstart=25)  # choose K=3

## Produce confusion matrix
tab <- table(true=newcluster$Pop, est=km$cluster)

## Map cluster numbers to appropriate population group
mapping1 <- apply(tab,2,which.max)
groups <- mapvalues(km$cluster, c(1,2,3), mapping1) 

## Plot results of kmeans colored by cluster
ggplot(newcluster) + geom_point(aes(x=V1, y=V2, color=as.factor(groups))) + labs(color='Cluster')

## Final accuracy table with correct classification
t <- table(newcluster$Pop, groups)
t
##       groups
##         1  2  3
##   POP1 69  0  6
##   POP2  2 67  6
##   POP3 10 10 55
## Accuracy
sum(diag(t)) /sum(t)
## [1] 0.8488889

The accuracy of K means clustering on the multidimensionally scaled data with K=3 is 0.85.

Hierarchical Clustering

Next, we run hierarchical clustering on the multidimensionally scaled data using the Euclidean distance metric.

We run the hierarchical clustering with 5 different linkage methods: complete, single, centroid, average, and ward.D2.

## Set seed for reproducibility
set.seed(211)

## Calculate euclidean distance for MDS points
dX1 = dist(newcluster[1:2], method="euclidean")

## Run hierarchical clustering for all 5 linkage methods
hc1 = hclust(dX1, method="complete")
hc2 = hclust(dX1, method="single")
hc3 = hclust(dX1, method="centroid")
hc4 = hclust(dX1, method="average")
hc5 = hclust(dX1, method="ward.D2")

## Plot all 5 dendrograms
plot(as.dendrogram(hc1), las=1, main='Complete linkage')

plot(as.dendrogram(hc2), las=1, main='Single linkage')

plot(as.dendrogram(hc3), las=1, main='Centroid linkage')

plot(as.dendrogram(hc4), las=1, main='Average linkage')

plot(as.dendrogram(hc5), las=1, main='Ward linkage')

It is clear that ward.D2 makes the most sense in this context, so we cut the dendrogram at height 90 to evaluate the 3 clusters versus the true population groups.

set.seed(211)
## Cut the dendrogram at height 90
clusters <- cutree(hc5, h=90)

## Produce confusion matrix 
tab2 <- table(newdf$Pop, clusters)

## Map cluster numbers to appropriate population group
mapping2 <- apply(tab2,2,which.max)
groups2 <- mapvalues(clusters, c(1,2,3),mapping2) 

## Plot hierarchical clustering colored by cluster
ggplot(newcluster) + geom_point(aes(x=V1, y=V2, color=as.factor(groups2))) + labs(color='Cluster')

## Final confusion matrix with correct classification
t <- table(newdf$Pop, groups2)
t
##       groups2
##         1  2  3
##   POP1 71  0  4
##   POP2  7 64  4
##   POP3 25  3 47
## Accuracy
sum(diag(t)) /sum(t)
## [1] 0.8088889

The accuracy of hierarchical clustering on the multidimensionally scaled data using Euclidean distance and Ward linkage is 0.81.

Gaussian Mixture Modeling (GMM)

Finally, we look at Gaussian Mixture Modeling (GMM), and plot the resulting clusters.

library(mclust)
## Set seed for reproducibility
set.seed(211)

## Run GMM on new data points obtained from MDS
mix = Mclust(newcluster[1:2], verbose=FALSE)

## Plot resulting clusters and uncertainty
plot(mix, what='classification')

plot(mix, what='uncertainty')

The first plot is the raw clustering of the x and y coordinates obtained from MDS, with each group’s corresponding centroids and standard deviations. The second plot shows the uncertainty of each data point, represented by the size of each data point (with smaller data points being the most certain and larger data points having higher uncertainty).

## Find the most likely cluster based on probabilities from GMM
preds <- apply(mix$z,1,which.max)

## Produce confusion matrix against true population groups
tab3 <- table(newdf$Pop,preds)

## Map cluster numbers to appropriate population group
mapping3 <- apply(tab3,2,which.max)
groups3 <- mapvalues(preds, c(1,2,3), mapping3) 

## Final confusion matrix with correct classification
t <- table(newdf$Pop, groups3)
t
##       groups3
##         1  2  3
##   POP1 63  0 12
##   POP2  2 64  9
##   POP3  9  3 63
## Accuracy
sum(diag(t)) /sum(t)
## [1] 0.8444444

The accuracy of GMM on the multidimensionally scaled data is 0.84.

Conclusion

Hopefully by reading this, you have laid a great foundation for using clustering in your work or projects related to data science! We covered the basic concepts of clustering and described some applications. We also dug deeper into three of the most common clustering techniques: k-means, hierarchical, and Gaussian mixture models. Through these in-depth explanations and the step-by-step examples that followed, you should be ready to implement and explain these clustering algorithms yourself! Thank you for reading and good luck in your future work!

References

Aru, Sanya. What Are the Benefits of Customer Segmentation for Your eCommerce Business?, 2020. https://makewebbetter.com/blog/benefits-customer-segmentation-higher-profitability-holiday-sales/

Carbajal, Maria. Separation and acquisition of two languages in early childhood: A multidisciplinary approach, 2020. https://hal.archives-ouvertes.fr/tel-01948483/

Frost, Jim. Standardization, Statistics by Jim, 2020. https://statisticsbyjim.com/glossary/standardization/ Gohrani, Kunal. Different Types of Distance Metrics Used in Machine Learning, 2019.https://

Gorur, Dilan. Introduction to Clustering, University of California Irvine, 2011. https://www.math.uci.edu/icamp/summer/research_11/gorur/clustering_iCAMP.pdf

Kaiser, Jocelyn. A judge said police can search the DNA of 1 million American without their consent. What’s next? American Association for the Advancement of Science, 2019. https://www.sciencemag.org/news/2019/11/judge-said-police-can-search-dna-millions-americans-without-their-consent-what-s-next

Kestemont, Michael, et al. Overview of Author Identification Task at PAN-2018, 2018. http://ceur-ws.org/Vol-2125/invited_paper_2.pdf

Machine Learning in Action. Email Spam Filtering: An Implementation with Python and Scikit-learn, KDnuggets. https://www.kdnuggets.com/2017/03/email-spam-filtering-an-implementation-with-python-and-scikit-learn.html

Malliaros, Fragkiskos and Michalis Vazirgiannis. Clustering and Community Detection in Directed Networks, Cornell University, 2013. https://arxiv.org/abs/1308.0971

Maynard, John. 4 ways government program managers can solve the fraud Catch-22, 2019. https://gcn.com/articles/2019/06/03/fraud-analytics.aspx

Morissette, Laurence, and Sylvain Chartier. “The k-means clustering technique: General considerations and implementation in Mathematica.” Tutorials in Quantitative Methods for Psychology 9.1 (2013): 15-24.

Porter, Michael. Clustering, University of Virginia, 2020. https://mdporter.github.io/SYS6018/lectures/06-clustering.pdf

Rosebrock, Adrian. What is the Jaccard Index?, 2016. https://deepai.org/machine-learning-glossary-and-terms/jaccard-index

Sahu, Beeren. Explained: Deeplab for Semantic Segmentation, 2019. https://beerensahu.wordpress.com/2019/02/07/explained-deeplab-for-semantic-segmentation/

Slonim, N. et al. “Hartigan’s K-Means Versus Lloyd’s K-Means - Is It Time for a Change?” IJCAI (2013).

Tibshirani, Ryan. “Clustering 3: Hierarchical clustering (continued); choosing the number of clusters.” Data Mining: 36-462/36-662, January 31, 2013. Carnegie Mellon University. PDF Presentation.

Whang, Joyce and Dhillon, Inderjit. Overlapping Community Detection in Massive Social Networks, Center for Big Data Analytics. https://bigdata.oden.utexas.edu/project/graph-clustering/